Skip to content

Conversation

nuzant
Copy link
Collaborator

@nuzant nuzant commented Oct 9, 2025

This pull request adds comprehensive support and documentation for running AReaL experiments with SkyPilot on cloud and Kubernetes infrastructures. It introduces example YAML configurations for both single-node and multi-node experiments, a detailed README for SkyPilot usage, and step-by-step installation instructions. These changes make it much easier to launch distributed AReaL experiments on GCP or Kubernetes using SkyPilot.

SkyPilot Integration and Documentation

  • Added a new section to docs/tutorial/installation.md with step-by-step instructions for installing and verifying SkyPilot, including GCP and Kubernetes setup guidance.
  • Created examples/skypilot/README.md providing detailed usage examples, explanations, and command lines for running AReaL experiments with SkyPilot, covering both single-node and multi-node setups.

Example Configurations for SkyPilot

  • Added examples/skypilot/local.yaml as a template for launching a single-node AReaL experiment with SkyPilot on GCP, specifying resources, storage, and launch commands.
  • Added examples/skypilot/ray_cluster.yaml for launching a multi-node Ray cluster with SkyPilot, including setup for distributed training and shared storage.
  • Added examples/skypilot/gsm8k_grpo_ray.yaml as a sample AReaL experiment configuration for Ray-based distributed training, detailing experiment parameters and resource allocation.

UPDATE: Separated examples and launcher into 2 PRs: #464

@nuzant nuzant changed the title Mzy/skypilot [Feature] Add SkyPilot launcher and examples Oct 9, 2025
@nuzant
Copy link
Collaborator Author

nuzant commented Oct 9, 2025

/gemini review

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces first-class support for SkyPilot, enabling AReaL experiments to be run on cloud and Kubernetes infrastructure. The changes include a new SkyPilotLauncherConfig, the SkyPilotLauncher implementation, and extensive documentation and examples.

The overall implementation is solid and follows SkyPilot's best practices. The new launcher is well-structured, handling cluster provisioning, job submission, and state management correctly. The documentation is also comprehensive and will be very helpful for users.

I've found a few issues that should be addressed:

  • There are hardcoded network ports in the launcher, which could cause conflicts.
  • There's a bug in the calculation of trainer nodes, leading to incorrect resource allocation.
  • The example ray_cluster.yaml and its corresponding documentation contain a shell script with syntax errors and a logic bug that would cause worker nodes to terminate prematurely.

Addressing these points will improve the robustness and correctness of the SkyPilot integration. Great work on adding this powerful feature!

future launches.
```bash
sky volumes apply storage-volume.yaml

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You need to make it clear from where the user needs to execute steps from this README.
Here it assumes examples/skypilot, but later it assumes the root of the repo

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could make contents about cloud buckets and volumes shorter, and refer to SkyPilot cloud bucket and volume guide.

Also, I have checked other places to ensure that users can execute these commands in the root of the repo.

/storage: areal-shared-storage
setup: |
pip3 install -e .

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe worth creating a virtual env instead of installing with pip as root

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AReaL repo root directory as workdir and our image ensure that we do not need pip install -e . (or any other installation) before launching the experiment. Therefore setup section here is removed.


```bash
export WANDB_API_KEY=<your-wandb-api-key>
sky launch -c areal --secret WANDB_API_KEY examples/skypilot/ray_cluster.yaml

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This command fails for me with this:

(head, rank=0, pid=4232) Executing training script on head node...
(worker1, rank=1, pid=3359, ip=10.170.27.163) Node setup complete for rank 1.
(head, rank=0, pid=4232) 2025-10-11 02:34:06,774        WARNING services.py:394 -- Found multiple active Ray instances: {'10.156.61.243:6380', '10.156.61.243:6379'}. Connecting to latest cluster at 10.156.61.243:6379. You can override this by setting the `--address` flag or `RAY_ADDRESS` environment variable.
(head, rank=0, pid=4232) 2025-10-11 02:34:06,774        INFO worker.py:1771 -- Connecting to existing Ray cluster at address: 10.156.61.243:6379...
(head, rank=0, pid=4232) 2025-10-11 02:34:06,785        INFO worker.py:1942 -- Connected to Ray cluster. View the dashboard at http://127.0.0.1:8265 
(head, rank=0, pid=4232) Traceback (most recent call last):
(head, rank=0, pid=4232)   File "<frozen runpy>", line 198, in _run_module_as_main
(head, rank=0, pid=4232)   File "<frozen runpy>", line 88, in _run_code
(head, rank=0, pid=4232)   File "/root/sky_workdir/areal/launcher/ray.py", line 591, in <module>
(head, rank=0, pid=4232)     main()
(head, rank=0, pid=4232)   File "/root/sky_workdir/areal/launcher/ray.py", line 330, in main
(head, rank=0, pid=4232)     config, _ = parse_cli_args(sys.argv[1:])
(head, rank=0, pid=4232)                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232)   File "/root/sky_workdir/areal/api/cli_args.py", line 1308, in parse_cli_args
(head, rank=0, pid=4232)     cfg = hydra_compose(
(head, rank=0, pid=4232)           ^^^^^^^^^^^^^^
(head, rank=0, pid=4232)   File "/usr/local/lib/python3.12/dist-packages/hydra/compose.py", line 38, in compose
(head, rank=0, pid=4232)     cfg = gh.hydra.compose_config(
(head, rank=0, pid=4232)           ^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232)   File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/hydra.py", line 594, in compose_config
(head, rank=0, pid=4232)     cfg = self.config_loader.load_configuration(
(head, rank=0, pid=4232)           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232)   File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/config_loader_impl.py", line 142, in load_configuration
(head, rank=0, pid=4232)     return self._load_configuration_impl(
(head, rank=0, pid=4232)            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232)   File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/config_loader_impl.py", line 244, in _load_configuration_impl
(head, rank=0, pid=4232)     parsed_overrides, caching_repo = self._parse_overrides_and_create_caching_repo(
(head, rank=0, pid=4232)                                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232)   File "/usr/local/lib/python3.12/dist-packages/hydra/_internal/config_loader_impl.py", line 228, in _parse_overrides_and_create_caching_repo
(head, rank=0, pid=4232)     parsed_overrides = parser.parse_overrides(overrides=overrides)
(head, rank=0, pid=4232)                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(head, rank=0, pid=4232)   File "/usr/local/lib/python3.12/dist-packages/hydra/core/override_parser/overrides_parser.py", line 99, in parse_overrides
(head, rank=0, pid=4232)     raise OverrideParseException(
(head, rank=0, pid=4232) hydra.errors.OverrideParseException: mismatched input '=' expecting <EOF>
(head, rank=0, pid=4232) See https://hydra.cc/docs/1.2/advanced/override_grammar/basic for details
(head, rank=0, pid=4232) Node setup complete for rank 0.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is caused by +trainer_env_vars="WANDB_API_KEY=$WANDB_API_KEY". This is a limitation of hydra, which does not allow = to appear in the command line arguments. Currently, users can only set environment variables in the yaml config file. We are finding workarounds for users to set environment variables in the command lines.

Now I think we just remove WANDB_API_KEY from examples to make it clear and runnable.

--config examples/skypilot/gsm8k_grpo_ray.yaml \
experiment_name=<your experiment name> \
trial_name=<your trial name> \
trainer_env_vars="WANDB_API_KEY=$WANDB_API_KEY"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong and needs to be replaced with '+launcher.trainer_env_vars="WANDB_API_KEY='$WANDB_API_KEY'"'
otherwise it fails

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

Comment on lines 123 to 127
If `GCP: enabled` or `Kubernetes: enabled` are shown, you're ready to use SkyPilot with
AReaL. Check [here](../examples/skypilot.md) for a detailed example to run AReaL with
SkyPilot. For more options and details for SkyPilot, see the official
[SkyPilot installation guide](https://docs.skypilot.co/en/latest/getting-started/installation.html).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to link to this page on how to configure K8s with work with SkyPilot: https://docs.skypilot.co/en/latest/reference/kubernetes/kubernetes-setup.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added reference in the Kubernetes setup section above.

resolve and distributed checkpointing. The following guideline shows how to use SkyPilot
volumes to setup a high-performance shared storage.

1. **Define the volume.** Create a YAML file describing the volume you want SkyPilot to

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While using volumes is fine, this is not required. And using cloud buckets could be simpler: https://docs.skypilot.co/en/latest/reference/storage.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added cloud bucket usage in the example.

```

If `GCP: enabled` or `Kubernetes: enabled` are shown, you're ready to use SkyPilot with
AReaL. Check [here](../examples/skypilot.md) for a detailed example to run AReaL with

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file doesn't exist

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we changed this link to https://github.com/inclusionAI/AReaL/blob/main/examples/skypilot/README.md to ensure the link is available in our documentation pages after this PR is merged into main.

```bash
# Ensure your kubeconfig is at ~/.kube/config
mkdir -p ~/.kube
cp /path/to/kubeconfig ~/.kube/config

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's unclear where /path/to/kubeconfig comes from

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this and referred to skypilot k8s setup guide instead.

```yaml
resources:
accelerators: H100:8
image_id: docker:ghcr.io/inclusionai/areal-runtime:v0.3.3

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this image is configured to use a custom PyPI index https://pypi.antfin-inc.com/simple.
It doesn't work for me. Here's what I see:

(setup pid=4496) Looking in indexes: https://pypi.antfin-inc.com/simple
(setup pid=4496) Obtaining file:///root/sky_workdir
(setup pid=4496)   Installing build dependencies: started
(setup pid=3465, ip=10.170.27.38) Looking in indexes: https://pypi.antfin-inc.com/simple
(setup pid=3465, ip=10.170.27.38) Obtaining file:///root/sky_workdir
(setup pid=3465, ip=10.170.27.38)   Installing build dependencies: started
(setup pid=4496)   Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38)   Installing build dependencies: still running...
(setup pid=4496)   Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38)   Installing build dependencies: still running...
(setup pid=4496)   Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38)   Installing build dependencies: still running...
(setup pid=4496)   Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38)   Installing build dependencies: still running...
(setup pid=4496)   Installing build dependencies: still running...
(setup pid=3465, ip=10.170.27.38)   Installing build dependencies: still running...
(setup pid=4496)   Installing build dependencies: still running...
(setup pid=4496)   Installing build dependencies: finished with status 'error'
(setup pid=4496)   error: subprocess-exited-with-error
(setup pid=4496)   
(setup pid=4496)   × pip subprocess to install build dependencies did not run successfully.
(setup pid=4496)   │ exit code: 1
(setup pid=4496)   ╰─> [8 lines of output]
(setup pid=4496)       Looking in indexes: https://pypi.antfin-inc.com/simple
(setup pid=4496)       WARNING: Retrying (Retry(total=4, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80fda300>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496)       WARNING: Retrying (Retry(total=3, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80bca2d0>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496)       WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80bca480>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496)       WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80bca5d0>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496)       WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'ConnectTimeoutError(<pip._vendor.urllib3.connection.HTTPSConnection object at 0x7fbd80bca780>, 'Connection to pypi.antfin-inc.com timed out. (connect timeout=100.0)')': /simple/setuptools/
(setup pid=4496)       ERROR: Could not find a version that satisfies the requirement setuptools>=61.0 (from versions: none)
(setup pid=4496)       ERROR: No matching distribution found for setuptools>=61.0
(setup pid=4496)       [end of output]
(setup pid=4496)   
(setup pid=4496)   note: This error originates from a subprocess, and is likely not a problem with pip.
(setup pid=4496) error: subprocess-exited-with-error
(setup pid=4496) 
(setup pid=4496) × pip subprocess to install build dependencies did not run successfully.
(setup pid=4496) │ exit code: 1
(setup pid=4496) ╰─> See above for output.
(setup pid=4496) 
(setup pid=4496) note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Job 1's setup failed with return code list: [137, 1]
✓ Job finished (status: FAILED_SETUP).
command terminated with exit code 100

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It does not requires any installation to run experiment now. However the PyPI index is still custom for our public image. We will mark this and fix this problem in our next image release.

echo "Starting Ray head node..."
ray start --head --port=6379
while [ $(ray nodes | grep NODE_ID | wc -l) -lt $num_nodes ]; do

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ray nodes command doesn't exist:

(head, rank=0, pid=4484) Usage: ray [OPTIONS] COMMAND [ARGS]...
(head, rank=0, pid=4484) Try 'ray --help' for help.
(head, rank=0, pid=4484) 
(head, rank=0, pid=4484) Error: No such command 'nodes'.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this by using ray status instead.

echo "Executing training script on head node..."
python3 -m areal.launcher.ray examples/math/gsm8k_grpo.py \
--config examples/skypilot/gsm8k_grpo_ray.yaml \
experiment_name=<your experiment name> \

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use envs and secrets to set these as env vars:

envs: 
  EXPERIMENT_NAME: my-areal-experiment
  TRIAL_NAME: my-trial-name

secrets:
  WANDB_API_KEY: null

and then:

experiment_name=$EXPERIMENT_NAME\
            trial_name=$TRIAL_NAME \
            

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link

@alex000kim alex000kim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this PR a bit raw.
The training job doesn't run due to incorrect syntax in several places:

  • non-existent commands
  • incorrect parameters
  • etc.

@nuzant
Copy link
Collaborator Author

nuzant commented Oct 13, 2025

I think this PR a bit raw. The training job doesn't run due to incorrect syntax in several places:

  • non-existent commands
  • incorrect parameters
  • etc.

Thanks for your review! We have GCP access now and we will be able to test and debug this PR by ourselves. We will start fixing this PR right away.

Comment on lines 10 to 11
n_nodes: 2
n_gpus_per_node: 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is n_nodes=2 correct? What should be the desired n_gpus_per_node?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the goal of using a 2-node cluster with 1 GPU on each node instead of a single node with 2x, 4x or even 8x GPUs?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend using SKYPILOT_NUM_NODES and SKYPILOT_NUM_GPUS_PER_NODE instead. See https://docs.skypilot.co/en/latest/running-jobs/environment-variables.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have changed the example to a setting with 2x 8 GPU nodes. And now the cluster.n_nodes and cluster.n_gpus_per_node are set by $SKYPILOT_NUM_NODES and $SKYPILOT_NUM_GPUS_PER_NODE in run field of skypilot yaml file.

--config examples/math/gsm8k_grpo.yaml \
experiment_name=gsm8k-grpo \
trial_name=trial0 \
cluster.n_gpus_per_node=2 \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move this to yaml, or specify both n_nodes and n_gpus_per_node.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd recommend using SKYPILOT_NUM_NODES and SKYPILOT_NUM_GPUS_PER_NODE instead. See https://docs.skypilot.co/en/latest/running-jobs/environment-variables.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

run: |
# Get the Head node's IP and total number of nodes (environment variables injected by SkyPilot).
head_ip=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
num_nodes=$(echo "$SKYPILOT_NODE_IPS" | wc -l)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now num_nodes are replaced by $SKYPILOT_NUM_NODES.

### Running AReaL with Ray Launcher

The following example shows how to setup a ray cluster with SkyPilot and then use AReaL
to run GRPO with GSM8K dataset on 2 nodes, each with 1 A100 GPU. This example runs on

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of using "2 nodes, each with 1 A100 GPU" instead of a single node with several GPUs?
This training will be much slower due to the slow interconnectivity speed between nodes.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be fine as an MVP to show the distributed training : )

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Its original purpose is to demonstrate an example of distributed training. I have changed the example to a more practical 2x 8 GPU nodes setting.

```yaml
file_mounts:
/storage: gs://areal-default

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of hard-coding gs, can we use something like

file_mounts:
  /my_data:
    source: s3://my-bucket/  # or gs://, https://<azure_storage_account>.blob.core.windows.net/<container>, r2://, cos://<region>/<bucket>, oci://<bucket_name>
    mode: MOUNT  # MOUNT or COPY or MOUNT_CACHED. Defaults to MOUNT. Optional.

as per https://docs.skypilot.co/en/latest/reference/storage.html

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

run: |
# Get the Head node's IP and total number of nodes (environment variables injected by SkyPilot).
head_ip=$(echo "$SKYPILOT_NODE_IPS" | head -n1)
num_nodes=$(echo "$SKYPILOT_NODE_IPS" | wc -l)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replace with an env var

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@@ -0,0 +1,42 @@

resources:
infra: gcp

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not hard-code GCP and remove the infra here altogether?
One of the key value props of SkyPilot is the ability to run the same workload on different clouds and k8s.
So the users are able to do this:

sky launch -c mycluster sky.yaml --infra aws
sky launch -c mycluster sky.yaml --infra gcp
sky launch -c mycluster sky.yaml --infra k8s

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed GCP hard-code in the yaml files and added a note on how to use different cloud and k8s.


## (Optional) Install SkyPilot

SkyPilot helps you run AReaL easily on cloud or Kubernetes infrastructures. Below shows

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a link to SkyPilot docs + mention that it supports 17+ clouds

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course!

@alex000kim
Copy link

alex000kim commented Oct 20, 2025

I'd recommend renaming examples/skypilot/local.yaml to examples/skypilot/local.sky.yaml and examples/skypilot/ray_cluster.yaml to examples/skypilot/ray_cluster.sky.yaml.
Otherwise, it can be confusing trying to distinguish which yamls are SkyPilot tasks and which ones are areal configs.

Also, "local" in local.yaml is confusing since it doesn't run locally.

num_nodes: 1

file_mounts:
/storage: gs://areal-default

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth pointing out that /storage is set in examples/skypilot/gsm8k_grpo_ray.yaml

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment to clarify this.

@alex000kim
Copy link

I confirm that I was able to run examples/skypilot/local.yaml on K8s

@alex000kim
Copy link

For me, examples/skypilot/ray_cluster.yaml only worked after adding

config:
  kubernetes:
    pod_config:
      spec:
        containers:
        - securityContext:
            capabilities:
              add:
              - IPC_LOCK

but it might be specific to my cluster since the nodes are connected via infiniband.
The rest of it looks good.

@nuzant
Copy link
Collaborator Author

nuzant commented Oct 21, 2025

I'd recommend renaming examples/skypilot/local.yaml to examples/skypilot/local.sky.yaml and examples/skypilot/ray_cluster.yaml to examples/skypilot/ray_cluster.sky.yaml. Otherwise, it can be confusing trying to distinguish which yamls are SkyPilot tasks and which ones are areal configs.

Also, "local" in local.yaml is confusing since it doesn't run locally.

Great suggestion! Changed local.yaml to single_node.sky.yaml and ray_cluster.yaml to ray_cluster.sky.yaml.

@nuzant
Copy link
Collaborator Author

nuzant commented Oct 21, 2025

For me, examples/skypilot/ray_cluster.yaml only worked after adding

config:
  kubernetes:
    pod_config:
      spec:
        containers:
        - securityContext:
            capabilities:
              add:
              - IPC_LOCK

but it might be specific to my cluster since the nodes are connected via infiniband. The rest of it looks good.

Added a note on additional config when using a cluster with infiniband.

@nuzant
Copy link
Collaborator Author

nuzant commented Oct 21, 2025

Thanks for your review! Please check again if recent changes have addressed your comments. @alex000kim @garrett4wade

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants